BioConductor - Data Extraction

This notebook aims to retrieve and collect all the available (useful) data from BioConductor. In particular, it collects the R packages' meta-data for a given set of versions.


In [8]:
import pandas
import requests
import json
import BeautifulSoup as bs

from datetime import date
from itertools import repeat

We will retrieve a lot of data, we can benefit from IPython's parallel computation tool.

To use this notebook, you need either to configure your IPController or to start a cluster of IPython nodes, using ipcluster start -n 4 for example. See https://ipython.org/ipython-doc/dev/parallel/parallel_process.html for more information.

It seems that most recent versions of IPython Notebook can directly start cluster from the web interface, under the Cluster tab.


In [9]:
from IPython import parallel
clients = parallel.Client()
clients.block = True # synchronous computations
print 'Clients:', str(clients.ids)


Clients: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

We first define a set of useful constants that store several URL's. As BioConductor splits its packages in three categories (Softwares, AnnotationData and ExperimentData), we'll store three distinct URL's for each of those.


In [10]:
CATEGORIES = ['Software', 'AnnotationData', 'ExperimentData']

# Base list, not used in this notebook. Can be used for a "human" to see the list in a pretty format.
BASE_LIST = {
    'Software': 'http://bioconductor.org/packages/{version}/BiocViews.html#___Software',
    'AnnotationData': 'http://bioconductor.org/packages/{version}/BiocViews.html#___AnnotationData',
    'ExperimentData': 'http://bioconductor.org/packages/{version}/BiocViews.html#___ExperimentData'
}

# Lists that will be parsed. Is used by BioConductor to populate BASE_LIST.
JSON_LIST = {
    'Software': 'http://bioconductor.org/packages/json/{version}/bioc/packages.js',
    'AnnotationData': 'http://bioconductor.org/packages/json/{version}/data/annotation/packages.js',
    'ExperimentData': 'http://bioconductor.org/packages/json/{version}/data/experiment/packages.js'
}

# Details page for every package. 
PACKAGE_DETAILS = {
    'Software': 'http://bioconductor.org/packages/{version}/bioc/html/{name}.html',
    'AnnotationData': 'http://bioconductor.org/packages/{version}/data/annotation/html/{name}.html',
    'ExperimentData': 'http://bioconductor.org/packages/{version}/data/experiment/html/{name}.html'
}

# Available versions and the corresponding date.
VERSIONS = [
    ('1.6', '2005-05-18'),
    ('1.7', '2005-10-14'),
    ('1.8', '2006-04-27'),
    ('1.9', '2006-10-04'),
    ('2.0', '2007-04-26'),
    ('2.1', '2007-10-08'),
    ('2.2', '2008-05-01'),
    ('2.3', '2008-10-22'),
    ('2.4', '2009-04-21'),
    ('2.5', '2009-10-28'),
    ('2.6', '2010-04-23'),
    ('2.7', '2010-10-18'),
    ('2.8', '2011-04-14'),
    ('2.9', '2011-11-01'),
    ('2.10', '2012-04-02'),
    ('2.11', '2012-10-03'),
    ('2.12', '2013-04-04'),
    ('2.13', '2013-10-15'),
    ('2.14', '2014-04-14'),
    ('3.0', '2014-10-14'),
    # ('3.1', '2015-04-17'),
]
# The pages of versions <2.5 do not have the same structure.
VERSIONS = filter(lambda x: x[1] >= '2009-10-28', VERSIONS)

# Meta-data we're interested in.
METADATA = ['Version', 'License', 'Depends', 'Imports', 'Suggests']

# Output
FILENAME = '../data/bioconductor-{date}.csv'.format(date=date.today().isoformat())

In [11]:
def metadata_for_packages(category, name, version):
    """
    Return a subset of the meta-data that are available for this package.
    The subset is built upon the items in METADATA. 
    """
    try:
        content = requests.get(PACKAGE_DETAILS[category].format(version=version, name=name)).content
        soup = bs.BeautifulSoup(content)
        table = soup.find(name='table', attrs={'class': 'details'})
        data = {}
        for row in table.findChildren('tr'):
            key, value = row.findChildren('td')
            if key.text in METADATA:
                data[key.text] = value.text
        return data
    except Exception: 
        print 'Exception while working on', name, version, 'in', category
        raise

In [12]:
def packages_list(category, version):
    """
    Return a list of available packages for the given version in the given category.
    """
    content = requests.get(JSON_LIST[category].format(version=version)).content
    # Remove variable declaration
    _, content = content[:-1].split(' = ', 1)
    content = json.loads(content)
    return map(lambda x: x[0], content['content'])

We have now everything we need to retrieve all the data from BioConductor:

  1. Use packages_list for every CATEGORIES and every VERSIONS. This returns a list of package names.
  2. For every package, retrieve its meta-data using metadata_for_packages.

In [13]:
def get_data_for(category, date, version, package):
    pkg_data = metadata_for_packages(category, package, version)
    pkg_data['Package'] = package
    pkg_data['BiocVersion'] = version
    pkg_data['BiocDate'] = date
    pkg_data['BiocCategory'] = category
    return pkg_data


data = []
clients[:].execute('import requests')
clients[:].execute('import BeautifulSoup as bs')

export = ['metadata_for_packages', 'PACKAGE_DETAILS', 'METADATA']
for name in export:
    clients[:][name] = eval(name)
    
balanced = clients.load_balanced_view()


for version, date in VERSIONS:
    print 'BioConductor version', version
    for category in CATEGORIES:
        packages = packages_list(category, version)
        n = len(packages)
        print 'Version', version, '-', n, 'items retrieved for', category
        new_data = balanced.map(get_data_for, repeat(category, n), repeat(date, n), repeat(version, n), packages)
        data += new_data


BioConductor version 2.5
Version 2.5 - 353 items retrieved for Software
Version 2.5 - 482 items retrieved for AnnotationData
Version 2.5 - 76 items retrieved for ExperimentData
BioConductor version 2.6
Version 2.6 - 389 items retrieved for Software
Version 2.6 - 501 items retrieved for AnnotationData
Version 2.6 - 83 items retrieved for ExperimentData
BioConductor version 2.7
Version 2.7 - 419 items retrieved for Software
Version 2.7 - 519 items retrieved for AnnotationData
Version 2.7 - 85 items retrieved for ExperimentData
BioConductor version 2.8
Version 2.8 - 467 items retrieved for Software
Version 2.8 - 595 items retrieved for AnnotationData
Version 2.8 - 102 items retrieved for ExperimentData
BioConductor version 2.9
Version 2.9 - 514 items retrieved for Software
Version 2.9 - 601 items retrieved for AnnotationData
Version 2.9 - 118 items retrieved for ExperimentData
BioConductor version 2.10
Version 2.10 - 553 items retrieved for Software
Version 2.10 - 626 items retrieved for AnnotationData
Version 2.10 - 126 items retrieved for ExperimentData
BioConductor version 2.11
Version 2.11 - 608 items retrieved for Software
Version 2.11 - 668 items retrieved for AnnotationData
Version 2.11 - 137 items retrieved for ExperimentData
BioConductor version 2.12
Version 2.12 - 672 items retrieved for Software
Version 2.12 - 690 items retrieved for AnnotationData
Version 2.12 - 157 items retrieved for ExperimentData
BioConductor version 2.13
Version 2.13 - 750 items retrieved for Software
Version 2.13 - 698 items retrieved for AnnotationData
Version 2.13 - 181 items retrieved for ExperimentData
BioConductor version 2.14
Version 2.14 - 824 items retrieved for Software
Version 2.14 - 867 items retrieved for AnnotationData
Version 2.14 - 202 items retrieved for ExperimentData
BioConductor version 3.0
Version 3.0 - 936 items retrieved for Software
Version 3.0 - 895 items retrieved for AnnotationData
Version 3.0 - 223 items retrieved for ExperimentData

In [14]:
# Save in .csv file using pandas
df = pandas.DataFrame(data)
df = df[['Package'] + METADATA + ['BiocCategory', 'BiocVersion', 'BiocDate']]
df.to_csv(FILENAME)